Generating Cross-lingual Concept Space from Parallel Corpora on the Web
نویسندگان
چکیده
The information available in languages other than English on the World Wide Web is increasing significantly. To cross language boundaries between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in genre and domain and it is impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model to cross the language boundary. The objective of this research work is to mine English/Chinese parallel documents automatically from the World Wide Web and generate a crosslingual concept space automatically for cross-lingual information retrieval. The alignment method is developed based on dynamic programming to identify the one-to-one Chinese and English title pairs for building parallel corpus. The Hopfield network is then employed to generate the cross-lingual concept space based on the statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval.
منابع مشابه
An Evaluation of the Concept Retrieval Annotation for Spanish-English CLEFER Parallel Corpora
This paper presents a study about the use of the concept retrieval annotation method for parallel corpora. The concept retrieval annotation method (CRA) consists of considering concepts as documents and text chunks as queries [1]. Concepts with higher similarity to text chunks are considered for generating the final semantic annotation. CRA makes use of an existing knowledge resource (KR) from ...
متن کاملAn associate constraint network approach to extract multi-lingual information for crime analysis
International crime and terrorism have drawn increasing attention in recent years. Retrieving relevant information from criminal records and suspect communications is important in combating international crime and terrorism. However, most of this information is written in languages other than English and is stored in various locations. Information sharing between countries therefore presents th...
متن کاملFinding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation
The cyberspace is populated with valuable information sources, expressed in about 1500 different languages and dialects. Yet, for the vast majority of WEB surfers this wealth of information is practically inaccessible or meaningless. Recent advancements in cross-lingual information retrieval, multilingual summarization, cross-lingual question answering and machine translation promise to narrow ...
متن کاملLearning Cross-lingual Word Embeddings via Matrix Co-factorization
A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix cofactorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matric...
متن کاملConstructing Chinese-English Concept Space
The information available in languages other than English on the World Wide Web is increasing significantly. According to a report from Computer Economics [1], 54% of Internet users are English speaking. However, it is predicted that there will be only 60% increase in Internet users among English speakers but there will be 150% growth among nonEnglish speakers for the next five years. By 2005, ...
متن کامل